Search CORE

430 research outputs found

Bagging ensemble selection for regression

Author: D.H. Wolpert
E. Bauer
J. Demšar
J.H. Friedman
J.H. Friedman
L. Breiman
L. Rokach
Q. Sun
R. Bryll
Z.-H. Zhou
Z.H. Zhou
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Bagging ensemble selection (BES) is a relatively new ensemble learning strategy. The strategy can be seen as an ensemble of the ensemble selection from libraries of models (ES) strategy. Previous experimental results on binary classiﬁcation problems have shown that using random trees as base classiﬁers, BES-OOB (the most successful variant of BES) is competitive with (and in many cases, superior to) other ensemble learning strategies, for instance, the original ES algorithm, stacking with linear regression, random forests or boosting. Motivated by the promising results in classiﬁcation, this paper examines the predictive performance of the BES-OOB strategy for regression problems. Our results show that the BES-OOB strategy outperforms Stochastic Gradient Boosting and Bagging when using regression trees as the base learners. Our results also suggest that the advantage of using a diverse model library becomes clear when the model library size is relatively large. We also present encouraging results indicating that the non negative least squares algorithm is a viable approach for pruning an ensemble of ensembles

Crossref

Research Commons@Waikato

Sec-Lib: Protecting Scholarly Digital Libraries From Infected Papers Using Active Machine Learning Framework

Author: A. Cohen
A. Lanzi
J. Wu
L. Giles
L. Rokach
N. Nissim
Y. Elovici
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/09/2019
Field of study

Researchers from academia and the corporate-sector rely on scholarly digital libraries to access articles. Attackers take advantage of innocent users who consider the articles' files safe and thus open PDF-files with little concern. In addition, researchers consider scholarly libraries a reliable, trusted, and untainted corpus of papers. For these reasons, scholarly digital libraries are an attractive-target and inadvertently support the proliferation of cyber-attacks launched via malicious PDF-files. In this study, we present related vulnerabilities and malware distribution approaches that exploit the vulnerabilities of scholarly digital libraries. We evaluated over two-million scholarly papers in the CiteSeerX library and found the library to be contaminated with a surprisingly large number (0.3-2%) of malicious PDF documents (over 55% were crawled from the IPs of US-universities). We developed a two layered detection framework aimed at enhancing the detection of malicious PDF documents, Sec-Lib, which offers a security solution for large digital libraries. Sec-Lib includes a deterministic layer for detecting known malware, and a machine learning based layer for detecting unknown malware. Our evaluation showed that scholarly digital libraries can detect 96.9% of malware with Sec-Lib, while minimizing the number of PDF-files requiring labeling, and thus reducing the manual inspection efforts of security-experts by 98%

AIR Universita degli studi di Milano

Scholarly digital libraries as a platform for malware distribution

Author: A. Cohen
A. Lanzi
J. Wu
L. Giles
L. Rokach
N. Nissim
Y. Elovici
Publication venue: 'IOS Press'
Publication date: 01/01/2017
Field of study

Researchers from academic institutions and the corporate sector rely heavily on scholarly digital libraries for accessing journal articles and conference proceedings. Primarily downloaded in the form of PDF files, there is a risk that these documents may be compromised by attackers. PDF files have many capabilities that have been widely used for malicious operations. Attackers increasingly take advantage of innocent users who open PDF files with little or no concern, mistakenly considering these files safe and relatively non-threatening. Researchers also consider scholarly digital libraries reliable and home to a trusted corpus of papers and untainted by malicious files. For these reasons, scholarly digital libraries are an attractive target for cyber-attacks launched via PDF files. In this study, we present several vulnerabilities and practical distribution attack approaches tailored for scholarly digital libraries. To support our claim regarding the attractiveness of scholarly digital libraries as an attack platform, we evaluated more than two million scholarly papers in the CiteSeerX library that were collected over 8 years and found it to be contaminated with a surprisingly large number (0.3%-2%) of malicious scholarly PDF documents, the origin of which is 46 different countries spread worldwide. More than 55% of the malicious papers in CiteSeerX were crawled from IP's belonging to USA universities, followed by those belonging to Europe (33.6%). We show how existing scholarly digital libraries can be easily leveraged as a distribution platform both for a targeted attack and in a worldwide manner. On average, a certain malicious paper caused high impact damage as it was downloaded 167 times in 5 years by researchers from different countries worldwide. In general, the USA and Asia downloaded the most malicious scholarly papers, 40.15% and 27.9%, respectively. The top malicious scholarly document downloaded is a malicious version of a popular paper in the computer forensics domain, with 2213 downloads in a worldwide coverage of 108 different countries. Finally, we suggest several concrete solutions for mitigating such attacks, including simple deterministic solutions and also advanced machine learning-based frameworks

AIR Universita degli studi di Milano

Computational complexity analysis of decision tree algorithms

Author: RC Barros
U Fayyad
MT Goodrich
MJ Kearns
B Leo
TS Lim
O Maimon
JR Quinlan
L Rokach
L Rokach
SL Salzberg
S Shalev-Shwartz
M Sipser
IH Witten
MJ Zaki
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/1992
Field of study

YesDecision tree is a simple but powerful learning technique that is considered as one of the famous learning algorithms that have been successfully used in practice for various classification tasks. They have the advantage of producing a comprehensible classification model with satisfactory accuracy levels in several application domains. In recent years, the volume of data available for learning is dramatically increasing. As a result, many application domains are faced with a large amount of data thereby posing a major bottleneck on the computability of learning techniques. There are different implementations of the decision tree using different techniques. In this paper, we theoretically and experimentally study and compare the computational power of the most common classical top-down decision tree algorithms (C4.5 and CART). This work can serve as part of review work to analyse the computational complexity of the existing decision tree classifier algorithm to gain understanding of the operational steps with the aim of optimizing the learning algorithm for large datasets

bepress Legal Repository

Crossref

Bradford Scholars

Brigham Young University Law School

Semi-supervised learning methods create models from a few labeled instances and a great number of unlabeled instances. They appear as a good option in scenarios where there is a lot of unlabeled data and the process of labeling instances is expensive, such as those where most Web applications stand. This paper proposes a semi-supervised self-training algorithm called Ant-Labeler. Self-training algorithms take advantage of supervised learning algorithms to iteratively learn a model from the labeled instances and then use this model to classify unlabeled instances. The instances that receive labels with high confidence are moved from the unlabeled to the labeled set, and this process is repeated until a stopping criteria is met, such as labeling all unlabeled instances. Ant-Labeler uses an ACO algorithm as the supervised learning method in the self-training procedure to generate interpretable rule-based models—used as an ensemble to ensure accurate predictions. The pheromone matrix is reused across different executions of the ACO algorithm to avoid rebuilding the models from scratch every time the labeled set is updated. Results showed that the proposed algorithm obtains better predictive accuracy than three state-of-the-art algorithms in roughly half of the datasets on which it was tested, and the smaller the number of labeled instances, the better the Ant-Labeler performance

Crossref

Kent Academic Repository

Multiple Imputation Ensembles (MIE) for dealing with missing data

Author: A Farhangfar
AM Sefidian
B Schölkopf
C Cortes
CT Tran
DA Newman
DB Rubin
DB Rubin
DH Wolpert
EL Silva-Ramírez
GE Batista
GJ van der Heijden
H Gao
IH Witten
J Demšar
J Honaker
J Honaker
J Scheffer
JA Sterne
JL Schafer
JL Schafer
JR Quinlan
K Abayomi
KM Ting
L Breiman
L Breiman
L Rokach
M Fichman
M Khalilia
M Spratt
MA Klebanoff
MJ Azur
NJ Horton
PJ García-Laencina
PJ Kelly
PN Tan
RJ Little
S García
S Van Buuren
S Van Buuren
SS Chae
SS Choi
U Garciarena
V Vapnik
X Chen
Y Dong
Y Freund
Y He
Z Che
Z Liu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/05/2020
Field of study

Missing data is a significant issue in many real-world datasets, yet there are no robust methods for dealing with it appropriately. In this paper, we propose a robust approach to dealing with missing data in classification problems: Multiple Imputation Ensembles (MIE). Our method integrates two approaches: multiple imputation and ensemble methods and compares two types of ensembles: bagging and stacking. We also propose a robust experimental set-up using 20 benchmark datasets from the UCI machine learning repository. For each dataset, we introduce increasing amounts of data Missing Completely at Random. Firstly, we use a number of single/multiple imputation methods to recover the missing values and then ensemble a number of different classifiers built on the imputed data. We assess the quality of the imputation by using dissimilarity measures. We also evaluate the MIE performance by comparing classification accuracy on the complete and imputed data. Furthermore, we use the accuracy of simple imputation as a benchmark for comparison. We find that our proposed approach combining multiple imputation with ensemble techniques outperform others, particularly as missing data increases

Crossref

University of East Anglia digital repository

“Trapped in an Empty Waiting Room”—The Existential Human Core of Loneliness in Old Age: A Meta-Synthesis

Author: Anne Clancy
Charlotte Wegener
Eriksson K.
Eriksson K.
Gabriele Kitzmüller
Heidegger M.
Mojtaba Vaismoradi
Moustakas C.
Rokach A.
Rokach A.
Sandelowski M.
Terese Bondas
Tillich P.
Tillich P.
Tornstam L.
Victor C.
Weiss R. S.
Publication venue: 'SAGE Publications'
Publication date: 01/01/2017
Field of study

Loneliness in old age has a negative influence on quality of life, health and survival. To understand the phenomenon of loneliness in old age, the voices of lonely older adults should be heard. Therefore, the purpose of this meta-synthesis was to synthesize scientific studies of older adults’ experiences of loneliness. Eleven qualitative articles that met the inclusion criteria were analyzed and synthesized according to Noblit and Hare’s meta-ethnographic approach. The analysis revealed the overriding meaning of the existential human core of loneliness in old age expressed through the metaphor "trapped in an empty waiting room". Four interwoven themes were found: 1) the negative emotions of loneliness, 2) the loss of meaningful interpersonal relationships, 3) the influence of loneliness on self-perception and 4) the older adults’ endeavors to deal with loneliness. The joint contribution of family members, health care providers, and volunteers is necessary to break the vicious circle of loneliness

Crossref

Cronfa at Swansea University

VBN

Munin - Open Research Archive

NORA - Norwegian Open Research Archives